Medical Survival Analysis Through Transduction of Semi-Supervised Regression Targets

نویسندگان

  • Faisal M. Khan
  • Qiuhua Liu
چکیده

A crucial challenge in predictive modeling for survival analysis applications such as medical prognosis is the accounting of censored observations in the data. While these time-to-event predictions inherently represent a regression problem, traditional regression approaches are challenged by the censored characteristics of the data. In such problems the true target times of a majority of instances are unknown; what is known is a censored target representing some indeterminate time before the true target time. While censored samples can be considered as semi-supervised targets, the current limited efforts in semi-supervised regression do not take into account the partial nature of unsupervised information; samples are treated as either fully labeled or unlabelled. This paper presents a novel semi-supervised learning approach where the true target times are approximated from the censored times through transduction. The method can be employed to transform traditional regression methods for survival analysis, or can be employed to enhance existing state-of-the-art survival analysis methods for improved predictive performance. The proposed approach represents one of the first applications of semi-supervised regression to survival analysis and yields a significant improvement in performance over the state-of-the-art in prostate and breast cancer prognosis applications. DOI: 10.4018/jkdb.2011070104 International Journal of Knowledge Discovery in Bioinformatics, 2(3), 52-65, July-September 2011 53 Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. action of whether a potentially significant gene will continue to be relevant when combined with other factors in a multivariate setting (Donovan et al., 2009) in order to possibly prioritize and identify candidate genes for targeted therapeutic drug development. While time-to-event prediction is inherently a regression problem, it challenges computational modeling approaches due to the fact that healthcare data in such settings is characterized by censored and non-censored (event) observations. Healthcare data used in such prognostic modeling is usually obtained from tracking patients over the course of a well designed study, perhaps lasting years. Contrary to traditional regression problems, the information for most observations is incomplete and only known “up-to-a-point.” Patients who have experienced the endpoint of interest (cancer remission, recurrence, etc.) during their followup are considered as non-censored or events. Patients that did not experience the endpoint during study or were lost to follow-up for any cause (i.e., the patient moved during a multiyear study) are considered censored. All that is known about them is that they were disease free up to a certain point, but what subsequently occurred is unknown. For a d-dimensional vector xi Є R d the observed time Si is called the censoring time. For such individuals, it is only known that they survived for at least time Si. The actual target Ti is unknown for censored cases, thus Si < Ti . An important assumption is that Ti and Si are independent conditional on xi, i.e., the cause for censoring is independent of the survival time. With an indicator function δi which is 0 if an event occurred and 1 if the observation is censored, the available training data can be summarized for N patients as D = { Ti, xi,δi } N i=1 (Raykar et al., 2008). Censored observations contribute incomplete information as the event of interest may occur after they were lost to follow-up. Simply omitting the censored observations (Burke et al., 1997; Shivaswamy, Chu, & Janasche, 2007) or treating them as non-recurring samples in a classifier (Snow, Smith, & Catalona, 1997) both bias the resulting model and should be avoided. Additionally, in the field of healthcare diagnostics, due to the costs involved in identifying acceptable patients who will provide consent for inclusion in research, and then actively tracking them over a significant period of time, the sample size is often small, in the tens or hundreds. Since most of samples may be censored, e.g., 91% in prostate cancer (Donovan et al., 2008), 76% in breast cancer (Mangasarian, Street, & Wolberg, 1994) dropping such patients is a further unattractive option and accounting for them is of crucial importance for a model. Such samples (event-free and lost to follow-up) are considered right-censored; their information on the right-hand side of a timeline is unknown. The problem is further confounded by the fact that non-censored patients may experience the event-of-interest prior to their recorded time (Si > Ti); for example a cancer patient may visit a doctor every six months, so if recurrence is observed, it happened somewhere in the six months between his last visit and the visit when the disease was detected. The term left-censoring describes this phenomenon where even the status of event patients is not completely known. The incomplete nature of the outcome targets in time-to-event prediction thus challenges traditional regression techniques and precludes their use. Instead, methods which can correctly account for censored observations are crucial for analyzing time-to-event problems. A heretofore unrealized notion of great interest is the fact that the censored samples prevalent in time-to-event problems can be considered as semi-supervised targets. While there has been significant work in semi-supervised classification approaches, particularly for SVMs (Bennet & Demiriz, 1999; Chapelle, Sindhwami, & Keerthi, 2008; Chen, Wang, & Dong, 2003; Fung & Mangasarian, 2001, Gamerman, Vovk, & Vapnik, 1998; Kemp, Griffiths, Stromsten, & Tenenbaum, 2004; Seeger, 2006), there has been limited work in semi-supervised regression, especially SVR (Belkin, Niyogi, & Sindhwami, 2006; Cortes & Mohri, 2007; Rwebangira & Lafferty, 2008; Szummer & Jaakkola 2000; Zhou & Li, 2005). The limited work thus far in semi-supervised 12 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the product's webpage: www.igi-global.com/article/medical-survival-analysis-throughtransduction/63617?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Medicine, Healthcare, and Life Science. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=2

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Covariance Operator Based Dimensionality Reduction with Extension to Semi-Supervised Settings

We consider the task of dimensionality reduction for regression (DRR) informed by realvalued multivariate labels. The problem is often treated as a regression task where the goal is to find a low dimensional representation of the input data that preserves the statistical correlation with the targets. Recently, Covariance Operator Inverse Regression (COIR) was proposed as an effective solution t...

متن کامل

Semi-supervised learning by search of optimal target vector

We introduce a semi-supervised learning estimator which tends to the first kernel principal component as the number of labeled points vanishes. We show application of the proposed method for dimensionality reduction and develop a semi-supervised regression and classification algorithm for transductive inference. 2007 Elsevier B.V. All rights reserved.

متن کامل

Semi-supervised Penalized Output Kernel Regression for Link Prediction

Link prediction is addressed as an output kernel learning task through semi-supervised Output Kernel Regression. Working in the framework of RKHS theory with vectorvalued functions, we establish a new representer theorem devoted to semi-supervised least square regression. We then apply it to get a new model (POKR: Penalized Output Kernel Regression) and show its relevance using numerical experi...

متن کامل

Evaluation of Survival Analysis Models for Predicting Factors Infuencing the Time of Brucellosis Diagnosis

Background:Brucellosis or Malta fever is one of the most common zoonotic diseases in the world. In addition to causing human suffering and dire economic impact on animals, due to the high prevalence of Brucellosis in the western regions of Isfahan province, this study aimed to analyze effective factors in the time of Brucellosis diagnosis using parametric and semi-parametric mo...

متن کامل

Survival analysis of breast cancer patients with different chronic diseases through parametric and semi-parametric approaches

Introduction: There is a lack of information on the extent of dependency between&nbsp;chronic diseases and the survival rate of breast cancer. Until date, none of the models&nbsp;proposed has determined the impact of chronic diseases on breast cancer survival. This&nbsp;study, therefore, aimed to investigate the impacts of chronic diseases such as diabetes,&nbsp;blood pressure, and endocrine di...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IJKDB

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2011